Data sources

LODES

The network data for this project comes from the Longitudinal Employer-Household Dynamics (LEHD), a program of the Center for Economic Studies at the U.S. Census Bureau. They maintain the LEHD Origin-Destination Employment Statistics (LODES), which records the number of people who live in one census block and work in another for year from 2002 to 2022. The complete data represent a longitudinal, directed, weighted network with millions of nodes.

For this project, I’m only using the LODES for one city—Chicago—and one year—2022. The 01-fetch.R script fetches these data from the LODES FTP server (https://lehd.ces.census.gov/data/lodes/LODES8/).

Additional data sources

The 01-fetch.R script fetches two additional data sources that are used in my analysis and visualizations. First, we need the census tract geographic shapes, which are provided by the TIGER API. Here is the documentation. Second, since my analysis occurs at a neighborhood level, we need a way to crosswalk from census tracts to Chicago community areas (CCAs). UChicago’s Spatial Data Resources provides a file with the boundaries of all Chicago community areas that we can use for this purpose.

Chicago has 77 community areas as of 2025:

Preprocessing

After 01-fetch.R saves these three data sets to files in the data/ directory, the 02-tracts.R and 03-community-areas.R perform the following preprocessing steps:

  1. Aggregate commuting flows from the census block level to the census tract level.
  2. Determine each tract’s corresponding commuting area (CCA) by finding the commuting area that the tract’s centroid falls within.
  3. Aggregate commuting flows from the census tract level to the CCA level. All tracts outside Chicago are coded as "OUTSIDE CHICAGO".
  4. Compute some useful statistics for CCAs:
    1. Pairwise distances of all CCAs
    2. Total number of people who live in this CCA and work anywhere in Illinois
    3. Total number of people who live in this CCA and work in Chicago
    4. Number of people in both live and work in this CCA (i.e., don’t commute)
    5. Total number of people who work in this CCA
    6. Total number of people from Chicago who work in this CCA
    7. Average distance that workers from this CCA commute
    8. Median distance that workers from this CCA commute

Finally, the processed data are saved to data/ccas.geojson and data/cca_flows.csv.

Exploring the data

Where people who live in a given CCA work?

One simple result we can draw from this dataset is simply where people who live in any given commuting area work. For example, below we compare the commuting flows for residents of Hyde Park and Woodlawn. Hyde Park and Woodlawn are adjacent, but because Hyde Park is home to the University of Chicago, most of its residents work there. Woodlawn, on the other hand, has a shortage of local jobs, so most Woodlawn residents commute to the Loop.

In and out degree

In degree (number who work in this community area) is recorded in the w_total column:

## # A tibble: 6 × 2
##   name            w_total
##   <chr>             <int>
## 1 Loop             442777
## 2 Near North Side  201444
## 3 Near West Side   145641
## 4 Ohare             57408
## 5 West Town         32620
## 6 Hyde Park         27029

The commuting areas with the most jobs are those in the central city (the Loop, Near North Side, Near West Side, and West Town) or those that contain a major employer (O’Hare and Hyde Park).

Out degree (number of workers who live in this community area) is recorded in the h_total column:

## # A tibble: 10 × 2
##    name            h_total
##    <chr>             <int>
##  1 Lake View         53699
##  2 Near North Side   53596
##  3 West Town         46419
##  4 Logan Square      38107
##  5 Austin            35406
##  6 Lincoln Park      33749
##  7 Near West Side    31099
##  8 Belmont Cragin    28989
##  9 West Ridge        28168
## 10 Uptown            28028

Out degree isn’t characterized by a few extreme outliers like in degree is. However, notice the clear trend on the map: North Side neighborhoods tend to have more working residents than South Side neighborhoods, where unemployment tends to be higher.

Not surprisingly, neighborhoods with more working residents tend to also have more jobs.

Eigenvector centrality

## # A tibble: 10 × 2
##    name            eig_centrality
##    <chr>                    <dbl>
##  1 Loop                     1    
##  2 Near North Side          0.847
##  3 Near West Side           0.478
##  4 Lake View                0.462
##  5 West Town                0.366
##  6 Lincoln Park             0.306
##  7 Logan Square             0.260
##  8 Uptown                   0.201
##  9 Edgewater                0.167
## 10 Austin                   0.154

The node with highest eigenvector centrality is the Loop, followed by the neighborhoods north and west of the Loop. South Side community uniformly have very low eigenvector centrality.

The plot below highlights the relationships between eigenvector centrality and geographic location.

Local jobs availability

This plot shows the number of local jobs per working resident in each community area. In neighborhoods with the largest employment, like O’Hare and the Loop, there are many more jobs than working residents. By contrast, many neighborhoods have nearly ten times as many working residents as jobs, so almost everyone must commute.

Distance and flow size

Intuitively, we would expect that commuting is higher between neighborhoods that are closer together.

Modeling Commuting Patterns with an ERGM

Modeling this dataset with an exponential random graph model (ERGM) would help understand what factors determine where people work. However, the ERGMs we learned about in class only work with unweighted networks. Krivitsky (2012) extended the ERGM framework to networks with values that represent counts, which is what we have here.

chicago_only <- cca_graph %N>%
  filter(name != "OUTSIDE CHICAGO")

net <- asNetwork(chicago_only)

# Add node covariates
net %v% "residents_log" <- log(chicago_only %N>% pull(h_in_chicago))
net %v% "workers_log" <- log(chicago_only %N>% pull(w_from_chicago))

# Network covariate: geographic distances
# Make a full distance matrix. Recall that all dyads are represented in
# cca_flows, not just non-zero edges.
D <- cca_flows %>%
  filter(from != "OUTSIDE CHICAGO", to != "OUTSIDE CHICAGO") %>%
  select(from, to, distance) %>%
  pivot_wider(names_from = to, values_from = distance) %>%
  column_to_rownames("from") %>%
  as.matrix()
fit1 <- ergm(
  net ~
    sum + # baseline intensity
    nonzero + # sparsity
    nodeocov("residents_log") + # origin size effect
    nodeicov("workers_log") + # destination size effect
    edgecov(D), # deterrence by distance
  reference = ~Poisson,
  response = "n",
)

Thinking forward

Plan

Research question: Do commuting patterns at the start of the 21st century help explain why some Chicago neighborhoods experienced greater economic improvement over the following two decades?

Dependent variable: Change in neighborhood SES 2002-2022. Could be measured by:

  • Unemployment rate
  • Poverty rate
  • Median household income
  • Vacant lot share
  • Median/average rent
  • A composite index of some or all of the above

Independent variable: Commuting network measures. Could include:

  • Eigenvector centrality.
  • Outflow entropy (diversity).
  • Inbound job density. Number of inbound commuters per local resident.
  • Tie strength to high-opportunity areas: weighted sum of outbound commuters to high-income job zones.
  • Commuting asymmetry (workers in vs workers out)

Questions and concerns:

  • Why does this actually matter? Don’t want this to just be another “oh look, there’s a correlation!” study
  • Theory behind any of these relationships?

Next steps:

  1. Modify data fetching code to get LODES for 2002.
  2. Fetch data (ACS? Census? CMAP?) for poverty rate and average rent in 2002 and 2022 (and years in between? why not).
  3. Compute dependent variables (change in poverty rate).
  4. Compute network independent variables.
  5. Fit an awesome model
  6. Interpret results